Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
# Installing the libraries with the specified version.
!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 imbalanced-learn==0.10.1 xgboost==2.0.3 threadpoolctl==3.3.0 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer # Import SimpleImputer from sklearn.impute
from sklearn.pipeline import Pipeline
from sklearn.metrics import recall_score, confusion_matrix, classification_report, make_scorer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
import seaborn as sns
# Import necessary libraries
import pandas as pd
from google.colab import files
# Upload the datasets
uploaded = files.upload()
# Load the datasets into pandas DataFrames
train_df = pd.read_csv('Train.csv.csv')
test_df = pd.read_csv('Test.csv.csv')
Saving Test.csv.csv to Test.csv.csv Saving Train.csv.csv to Train.csv.csv
Display the first few rows of the training dataset
# Display the first few rows of the training dataset
print("\nFirst few rows of Training Dataset:")
display(train_df.head())
First few rows of Training Dataset:
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.464606 | -4.679129 | 3.101546 | 0.506130 | -0.221083 | -2.032511 | -2.910870 | 0.050714 | -1.522351 | 3.761892 | ... | 3.059700 | -1.690440 | 2.846296 | 2.235198 | 6.667486 | 0.443809 | -2.369169 | 2.950578 | -3.480324 | 0 |
| 1 | 3.365912 | 3.653381 | 0.909671 | -1.367528 | 0.332016 | 2.358938 | 0.732600 | -4.332135 | 0.565695 | -0.101080 | ... | -1.795474 | 3.032780 | -2.467514 | 1.894599 | -2.297780 | -1.731048 | 5.908837 | -0.386345 | 0.616242 | 0 |
| 2 | -3.831843 | -5.824444 | 0.634031 | -2.418815 | -1.773827 | 1.016824 | -2.098941 | -3.173204 | -2.081860 | 5.392621 | ... | -0.257101 | 0.803550 | 4.086219 | 2.292138 | 5.360850 | 0.351993 | 2.940021 | 3.839160 | -4.309402 | 0 |
| 3 | 1.618098 | 1.888342 | 7.046143 | -1.147285 | 0.083080 | -1.529780 | 0.207309 | -2.493629 | 0.344926 | 2.118578 | ... | -3.584425 | -2.577474 | 1.363769 | 0.622714 | 5.550100 | -1.526796 | 0.138853 | 3.101430 | -1.277378 | 0 |
| 4 | -0.111440 | 3.872488 | -3.758361 | -2.982897 | 3.792714 | 0.544960 | 0.205433 | 4.848994 | -1.854920 | -6.220023 | ... | 8.265896 | 6.629213 | -10.068689 | 1.222987 | -3.229763 | 1.686909 | -2.163896 | -3.644622 | 6.510338 | 0 |
5 rows × 41 columns
Display basic information of the Training Dataset
print("Training Dataset Info:")
display(train_df.info())
Training Dataset Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 19982 non-null float64 1 V2 19982 non-null float64 2 V3 20000 non-null float64 3 V4 20000 non-null float64 4 V5 20000 non-null float64 5 V6 20000 non-null float64 6 V7 20000 non-null float64 7 V8 20000 non-null float64 8 V9 20000 non-null float64 9 V10 20000 non-null float64 10 V11 20000 non-null float64 11 V12 20000 non-null float64 12 V13 20000 non-null float64 13 V14 20000 non-null float64 14 V15 20000 non-null float64 15 V16 20000 non-null float64 16 V17 20000 non-null float64 17 V18 20000 non-null float64 18 V19 20000 non-null float64 19 V20 20000 non-null float64 20 V21 20000 non-null float64 21 V22 20000 non-null float64 22 V23 20000 non-null float64 23 V24 20000 non-null float64 24 V25 20000 non-null float64 25 V26 20000 non-null float64 26 V27 20000 non-null float64 27 V28 20000 non-null float64 28 V29 20000 non-null float64 29 V30 20000 non-null float64 30 V31 20000 non-null float64 31 V32 20000 non-null float64 32 V33 20000 non-null float64 33 V34 20000 non-null float64 34 V35 20000 non-null float64 35 V36 20000 non-null float64 36 V37 20000 non-null float64 37 V38 20000 non-null float64 38 V39 20000 non-null float64 39 V40 20000 non-null float64 40 Target 20000 non-null int64 dtypes: float64(40), int64(1) memory usage: 6.3 MB
None
Number of Entries:
Number of Columns:
V1 to V40) and 1 target variable (Target).Data Types:
float64, which are numerical features, suggesting that all these features represent continuous or real-valued data.Target column is of type int64, indicating that it is a categorical variable, likely representing the classes for classification.Non-Null Values:
V1 and V2, which have 19,982 non-null entries. This means that V1 and V2 have 18 missing values each.Memory Usage:
Missing Data:
V1 and V2 requires attention. Missing data needs to be handled before feeding the data into a machine learning model. Common strategies include imputing missing values (e.g., using the mean, median, or mode) or removing rows with missing data.Feature Analysis:
Target Variable:
Target variable is an integer, likely representing binary classes (e.g., 0 for non-failure and 1 for failure in the context of the ReneWind project). Understanding the class distribution is crucial, especially if the classes are imbalancedDisplay the first few rows of the testing dataset
# Display the first few rows of the testing dataset
print("\nFirst few rows of Testing Dataset:")
display(test_df.head())
First few rows of Testing Dataset:
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.613489 | -3.819640 | 2.202302 | 1.300420 | -1.184929 | -4.495964 | -1.835817 | 4.722989 | 1.206140 | -0.341909 | ... | 2.291204 | -5.411388 | 0.870073 | 0.574479 | 4.157191 | 1.428093 | -10.511342 | 0.454664 | -1.448363 | 0 |
| 1 | 0.389608 | -0.512341 | 0.527053 | -2.576776 | -1.016766 | 2.235112 | -0.441301 | -4.405744 | -0.332869 | 1.966794 | ... | -2.474936 | 2.493582 | 0.315165 | 2.059288 | 0.683859 | -0.485452 | 5.128350 | 1.720744 | -1.488235 | 0 |
| 2 | -0.874861 | -0.640632 | 4.084202 | -1.590454 | 0.525855 | -1.957592 | -0.695367 | 1.347309 | -1.732348 | 0.466500 | ... | -1.318888 | -2.997464 | 0.459664 | 0.619774 | 5.631504 | 1.323512 | -1.752154 | 1.808302 | 1.675748 | 0 |
| 3 | 0.238384 | 1.458607 | 4.014528 | 2.534478 | 1.196987 | -3.117330 | -0.924035 | 0.269493 | 1.322436 | 0.702345 | ... | 3.517918 | -3.074085 | -0.284220 | 0.954576 | 3.029331 | -1.367198 | -3.412140 | 0.906000 | -2.450889 | 0 |
| 4 | 5.828225 | 2.768260 | -1.234530 | 2.809264 | -1.641648 | -1.406698 | 0.568643 | 0.965043 | 1.918379 | -2.774855 | ... | 1.773841 | -1.501573 | -2.226702 | 4.776830 | -6.559698 | -0.805551 | -0.276007 | -3.858207 | -0.537694 | 0 |
5 rows × 41 columns
Display basic information of the Testing Dataset
print("\nTesting Dataset Info:")
print(test_df.info())
Testing Dataset Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 4995 non-null float64 1 V2 4994 non-null float64 2 V3 5000 non-null float64 3 V4 5000 non-null float64 4 V5 5000 non-null float64 5 V6 5000 non-null float64 6 V7 5000 non-null float64 7 V8 5000 non-null float64 8 V9 5000 non-null float64 9 V10 5000 non-null float64 10 V11 5000 non-null float64 11 V12 5000 non-null float64 12 V13 5000 non-null float64 13 V14 5000 non-null float64 14 V15 5000 non-null float64 15 V16 5000 non-null float64 16 V17 5000 non-null float64 17 V18 5000 non-null float64 18 V19 5000 non-null float64 19 V20 5000 non-null float64 20 V21 5000 non-null float64 21 V22 5000 non-null float64 22 V23 5000 non-null float64 23 V24 5000 non-null float64 24 V25 5000 non-null float64 25 V26 5000 non-null float64 26 V27 5000 non-null float64 27 V28 5000 non-null float64 28 V29 5000 non-null float64 29 V30 5000 non-null float64 30 V31 5000 non-null float64 31 V32 5000 non-null float64 32 V33 5000 non-null float64 33 V34 5000 non-null float64 34 V35 5000 non-null float64 35 V36 5000 non-null float64 36 V37 5000 non-null float64 37 V38 5000 non-null float64 38 V39 5000 non-null float64 39 V40 5000 non-null float64 40 Target 5000 non-null int64 dtypes: float64(40), int64(1) memory usage: 1.6 MB None
Number of Entries:
Number of Columns:
V1 to V40) and 1 target variable (Target).Data Types:
float64, indicating they are numerical.Target column is of type int64, representing the categorical target variable for classification.Non-Null Values:
V1 and V2 have slightly fewer non-null entries compared to other columns:V1 has 4,995 non-null entries, indicating 5 missing values.V2 has 4,994 non-null entries, indicating 6 missing values.Memory Usage:
Missing Data:
V1 and V2 needs to be addressed, just as in the training dataset. Consistency in handling missing data across both datasets is crucial for ensuring that the model performs well on unseen data.Feature Analysis:
Target Variable:
Target variable is present in the testing set, which is essential for evaluating the model’s performance by comparing the predicted values to the actual target values.Memory Considerations:
V1 and V2 between the training and testing datasets highlights the need for robust data preprocessing pipelines that handle missing values consistently.# Check for missing values
missing_values = train_df.isnull().sum()
print("Missing Values in Training Dataset:")
print(missing_values[missing_values > 0])
missing_values_test = test_df.isnull().sum()
print("\nMissing Values in Testing Dataset:")
print(missing_values_test[missing_values_test > 0])
Missing Values in Training Dataset: V1 18 V2 18 dtype: int64 Missing Values in Testing Dataset: V1 5 V2 6 dtype: int64
Observation:
Missing Values Analysis:
Training Dataset:
Testing Dataset:
# Check for duplicate rows
duplicates_train = train_df.duplicated().sum()
print(f"Number of duplicate rows in Training Dataset: {duplicates_train}")
duplicates_test = test_df.duplicated().sum()
print(f"Number of duplicate rows in Testing Dataset: {duplicates_test}")
Number of duplicate rows in Training Dataset: 0 Number of duplicate rows in Testing Dataset: 0
Observation:
The datasets have been checked for duplicate entries. The results are as follows:
# Check data types of each column
print("Data Types in Training Dataset:")
print(train_df.dtypes)
print("\nData Types in Testing Dataset:")
print(test_df.dtypes)
Data Types in Training Dataset: V1 float64 V2 float64 V3 float64 V4 float64 V5 float64 V6 float64 V7 float64 V8 float64 V9 float64 V10 float64 V11 float64 V12 float64 V13 float64 V14 float64 V15 float64 V16 float64 V17 float64 V18 float64 V19 float64 V20 float64 V21 float64 V22 float64 V23 float64 V24 float64 V25 float64 V26 float64 V27 float64 V28 float64 V29 float64 V30 float64 V31 float64 V32 float64 V33 float64 V34 float64 V35 float64 V36 float64 V37 float64 V38 float64 V39 float64 V40 float64 Target int64 dtype: object Data Types in Testing Dataset: V1 float64 V2 float64 V3 float64 V4 float64 V5 float64 V6 float64 V7 float64 V8 float64 V9 float64 V10 float64 V11 float64 V12 float64 V13 float64 V14 float64 V15 float64 V16 float64 V17 float64 V18 float64 V19 float64 V20 float64 V21 float64 V22 float64 V23 float64 V24 float64 V25 float64 V26 float64 V27 float64 V28 float64 V29 float64 V30 float64 V31 float64 V32 float64 V33 float64 V34 float64 V35 float64 V36 float64 V37 float64 V38 float64 V39 float64 V40 float64 Target int64 dtype: object
Observation: Data Types in Training and Testing Datasets
Training Dataset:
V1 to V40) are of type float64.Target) is of type int64.Testing Dataset:
float64.Target) is also of type int64.Interpretation:
Consistency Across Datasets: The data types are consistent across both the training and testing datasets, which is crucial for ensuring that the model training and evaluation processes are smooth and error-free.
Appropriate Data Types:
float64 type for predictor variables is appropriate for continuous numerical data, which is common for sensor data like this.int64 type for the target variable is appropriate since it represents a binary classification (0 or 1).Action:
# Check distribution of the target variable
if 'Target' in train_df.columns:
print("Target Variable Distribution in Training Dataset:")
print(train_df['Target'].value_counts(normalize=True))
Target Variable Distribution in Training Dataset: Target 0 0.9445 1 0.0555 Name: proportion, dtype: float64
Observation: Target Variable Distribution in Training Dataset
Interpretation:
Action:
Balancing the classes will help in building a model that can more accurately predict failures, which is critical for the predictive maintenance objective.
# Generate summary statistics
print("Statistical Summary of Training Dataset:")
display(train_df.describe())
print("\nStatistical Summary of Testing Dataset:")
display(test_df.describe())
Statistical Summary of Training Dataset:
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 19982.000000 | 19982.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | ... | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 |
| mean | -0.271996 | 0.440430 | 2.484699 | -0.083152 | -0.053752 | -0.995443 | -0.879325 | -0.548195 | -0.016808 | -0.012998 | ... | 0.303799 | 0.049825 | -0.462702 | 2.229620 | 1.514809 | 0.011316 | -0.344025 | 0.890653 | -0.875630 | 0.055500 |
| std | 3.441625 | 3.150784 | 3.388963 | 3.431595 | 2.104801 | 2.040970 | 1.761626 | 3.295756 | 2.160568 | 2.193201 | ... | 5.500400 | 3.575285 | 3.183841 | 2.937102 | 3.800860 | 1.788165 | 3.948147 | 1.753054 | 3.012155 | 0.228959 |
| min | -11.876451 | -12.319951 | -10.708139 | -15.082052 | -8.603361 | -10.227147 | -7.949681 | -15.657561 | -8.596313 | -9.853957 | ... | -19.876502 | -16.898353 | -17.985094 | -15.349803 | -14.833178 | -5.478350 | -17.375002 | -6.438880 | -11.023935 | 0.000000 |
| 25% | -2.737146 | -1.640674 | 0.206860 | -2.347660 | -1.535607 | -2.347238 | -2.030926 | -2.642665 | -1.494973 | -1.411212 | ... | -3.420469 | -2.242857 | -2.136984 | 0.336191 | -0.943809 | -1.255819 | -2.987638 | -0.272250 | -2.940193 | 0.000000 |
| 50% | -0.747917 | 0.471536 | 2.255786 | -0.135241 | -0.101952 | -1.000515 | -0.917179 | -0.389085 | -0.067597 | 0.100973 | ... | 0.052073 | -0.066249 | -0.255008 | 2.098633 | 1.566526 | -0.128435 | -0.316849 | 0.919261 | -0.920806 | 0.000000 |
| 75% | 1.840112 | 2.543967 | 4.566165 | 2.130615 | 1.340480 | 0.380330 | 0.223695 | 1.722965 | 1.409203 | 1.477045 | ... | 3.761722 | 2.255134 | 1.436935 | 4.064358 | 3.983939 | 1.175533 | 2.279399 | 2.057540 | 1.119897 | 0.000000 |
| max | 15.493002 | 13.089269 | 17.090919 | 13.236381 | 8.133797 | 6.975847 | 8.006091 | 11.679495 | 8.137580 | 8.108472 | ... | 23.633187 | 16.692486 | 14.358213 | 15.291065 | 19.329576 | 7.467006 | 15.289923 | 7.759877 | 10.654265 | 1.000000 |
8 rows × 41 columns
Statistical Summary of Testing Dataset:
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4995.000000 | 4994.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | ... | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | -0.277622 | 0.397928 | 2.551787 | -0.048943 | -0.080120 | -1.042138 | -0.907922 | -0.574592 | 0.030121 | 0.018524 | ... | 0.232567 | -0.080115 | -0.392663 | 2.211205 | 1.594845 | 0.022931 | -0.405659 | 0.938800 | -0.932406 | 0.056400 |
| std | 3.466280 | 3.139562 | 3.326607 | 3.413937 | 2.110870 | 2.005444 | 1.769017 | 3.331911 | 2.174139 | 2.145437 | ... | 5.585628 | 3.538624 | 3.166101 | 2.948426 | 3.774970 | 1.785320 | 3.968936 | 1.716502 | 2.978193 | 0.230716 |
| min | -12.381696 | -10.716179 | -9.237940 | -14.682446 | -7.711569 | -8.924196 | -8.124230 | -12.252731 | -6.785495 | -8.170956 | ... | -17.244168 | -14.903781 | -14.699725 | -12.260591 | -12.735567 | -5.079070 | -15.334533 | -5.451050 | -10.076234 | 0.000000 |
| 25% | -2.743691 | -1.649211 | 0.314931 | -2.292694 | -1.615238 | -2.368853 | -2.054259 | -2.642088 | -1.455712 | -1.353320 | ... | -3.556267 | -2.348121 | -2.009604 | 0.321818 | -0.866066 | -1.240526 | -2.984480 | -0.208024 | -2.986587 | 0.000000 |
| 50% | -0.764767 | 0.427369 | 2.260428 | -0.145753 | -0.131890 | -1.048571 | -0.939695 | -0.357943 | -0.079891 | 0.166292 | ... | -0.076694 | -0.159713 | -0.171745 | 2.111750 | 1.702964 | -0.110415 | -0.381162 | 0.959152 | -1.002764 | 0.000000 |
| 75% | 1.831313 | 2.444486 | 4.587000 | 2.166468 | 1.341197 | 0.307555 | 0.212228 | 1.712896 | 1.449548 | 1.511248 | ... | 3.751857 | 2.099160 | 1.465402 | 4.031639 | 4.104409 | 1.237522 | 2.287998 | 2.130769 | 1.079738 | 0.000000 |
| max | 13.504352 | 14.079073 | 15.314503 | 12.140157 | 7.672835 | 5.067685 | 7.616182 | 10.414722 | 8.850720 | 6.598728 | ... | 26.539391 | 13.323517 | 12.146302 | 13.489237 | 17.116122 | 6.809938 | 13.064950 | 7.182237 | 8.698460 | 1.000000 |
8 rows × 41 columns
Observations: Statistical Summary of the Training and Testing Datasets
Mean and Standard Deviation:
Observation: The consistency in the mean and standard deviation between the training and testing datasets is a good sign, indicating that the data is similarly distributed.
Range of Values (Min, Max):
Observation: The presence of a wide range of values could be an indicator of outliers or just natural variability in the data. It's important to assess whether these extremes are valid observations or noise.
Quartile Distribution (25%, 50%, 75%):
Observation: Features with very narrow IQRs compared to their overall range may be concentrated around a specific set of values, which could indicate potential for feature scaling.
Target Variable:
0 (No Failure) and 1 (Failure).Observation: The target variable has a low mean value in both datasets, indicating that the data is imbalanced with a majority of observations being "No Failure." This imbalance needs to be considered during model development, potentially by using techniques such as oversampling, undersampling, or adjusting class weights.
General Conclusion:
# Check for columns with constant values
constant_columns_train = [col for col in train_df.columns if train_df[col].nunique() == 1]
print(f"Columns with constant values in Training Dataset: {constant_columns_train}")
constant_columns_test = [col for col in test_df.columns if test_df[col].nunique() == 1]
print(f"Columns with constant values in Testing Dataset: {constant_columns_test}")
Columns with constant values in Training Dataset: [] Columns with constant values in Testing Dataset: []
Observation: Constant Columns Check
[][]Interpretation:
Action:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the datasets into pandas DataFrames
train_df = pd.read_csv('Train.csv.csv')
test_df = pd.read_csv('Test.csv.csv')
# Function to plot a boxplot and a histogram along the same scale
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid = 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # Creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # Boxplot with mean indicated
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # Histogram with density curve if specified
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
plt.show()
# List of columns to exclude from plotting
exclude_columns = ['Target'] # Add any other columns you don't want to plot
# Plotting histograms and boxplots for all columns in the training dataset
for column in train_df.columns:
if column not in exclude_columns:
print(f'Plotting for {column}')
histogram_boxplot(train_df, feature=column, kde=True)
# Plotting histograms and boxplots for all columns in the testing dataset
for column in test_df.columns:
if column not in exclude_columns:
print(f'Plotting for {column}')
histogram_boxplot(test_df, feature=column, kde=True)
Plotting for V1
Plotting for V2
Plotting for V3
Plotting for V4
Plotting for V5
Plotting for V6
Plotting for V7
Plotting for V8
Plotting for V9
Plotting for V10
Plotting for V11
Plotting for V12
Plotting for V13
Plotting for V14
Plotting for V15
Plotting for V16
Plotting for V17
Plotting for V18
Plotting for V19
Plotting for V20
Plotting for V21
Plotting for V22
Plotting for V23
Plotting for V24
Plotting for V25
Plotting for V26
Plotting for V27
Plotting for V28
Plotting for V29
Plotting for V30
Plotting for V31
Plotting for V32
Plotting for V33
Plotting for V34
Plotting for V35
Plotting for V36
Plotting for V37
Plotting for V38
Plotting for V39
Plotting for V40
Plotting for V1
Plotting for V2
Plotting for V3
Plotting for V4
Plotting for V5
Plotting for V6
Plotting for V7
Plotting for V8
Plotting for V9
Plotting for V10
Plotting for V11
Plotting for V12
Plotting for V13
Plotting for V14
Plotting for V15
Plotting for V16
Plotting for V17
Plotting for V18
Plotting for V19
Plotting for V20
Plotting for V21
Plotting for V22
Plotting for V23
Plotting for V24
Plotting for V25
Plotting for V26
Plotting for V27
Plotting for V28
Plotting for V29
Plotting for V30
Plotting for V31
Plotting for V32
Plotting for V33
Plotting for V34
Plotting for V35
Plotting for V36
Plotting for V37
Plotting for V38
Plotting for V39
Plotting for V40
Observations:
Outliers: Multiple features show the presence of outliers, particularly on the higher end of the distribution. These outliers could significantly impact the mean and may require treatment (e.g., capping, transformation) before model training.
Skewness: Many features exhibit right-skewed distributions. Features with skewness may benefit from transformations like log or square root to achieve a more normal distribution, which could improve model performance.
Symmetry: Some features are symmetrically distributed, indicating they may already be normally distributed. These features might not need significant transformations.
Central Tendency: The mean and median for each feature were plotted on the histograms. For symmetrical distributions, the mean and median are close together, while for skewed distributions, there is a noticeable gap, with the mean typically being higher in right-skewed distributions.
Specific Feature Observations:
V1, V2, V5, V7, V8, V11, V13, V15, V17, V19, V21, V23, V25, V27, V29, V31, V33, V35, V37, V39:
V3, V4, V6, V9, V10, V12, V14, V16, V18, V20, V22, V24, V26, V28, V30, V32, V34, V36, V38, V40:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the datasets into pandas DataFrames
train_df = pd.read_csv('Train.csv.csv')
test_df = pd.read_csv('Test.csv.csv')
# Function to plot a boxplot and a histogram along the same scale
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid = 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # Creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # Boxplot with mean indicated
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # Histogram with density curve if specified
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
plt.show()
# Plotting all the features at once for the training dataset
for feature in train_df.columns:
if feature != 'Target': # Skip the target column
histogram_boxplot(train_df, feature, figsize=(12, 7), kde=True)
# Plotting all the features at once for the testing dataset
for feature in test_df.columns:
if feature != 'Target': # Skip the target column
histogram_boxplot(test_df, feature, figsize=(12, 7), kde=True)
Observation
V1:
V2:
V3:
V4:
V5:
V6:
V7:
V8:
V9:
V10:
V11:
V12:
V13:
V14:
V15:
V16:
V17:
V18:
V19:
V20:
V21:
V22:
V23:
V24:
V25:
V26:
V27:
V28:
V29:
V30:
V31:
V32:
V33:
V34:
V35:
V36:
V37:
V38:
V39:
V40:
Load Data
from google.colab import files
uploaded = files.upload()
train_df = pd.read_csv('Train.csv.csv')
test_df = pd.read_csv('Test.csv.csv')
Saving Test.csv.csv to Test.csv.csv Saving Train.csv.csv to Train.csv.csv
Define Target and Features
# Define the target column
target_column = 'Target'
# Separate features and target
X = train_df.drop(columns=[target_column])
y = train_df[target_column]
# Check for any missing values
print(X.isnull().sum())
print(y.isnull().sum())
V1 18 V2 18 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 dtype: int64 0
Observation
from sklearn.impute import SimpleImputer
# Impute missing values in features
imputer = SimpleImputer(strategy='median')
X = imputer.fit_transform(X)
Observation:
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix, classification_report
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
# Function to compute different metrics to check the classification model performance
def model_performance_classification_sklearn(model, predictors, target):
pred = model.predict(predictors)
acc = accuracy_score(target, pred)
recall = recall_score(target, pred)
precision = precision_score(target, pred)
f1 = f1_score(target, pred)
df_perf = pd.DataFrame({"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1}, index=[0])
return df_perf
# Define the Decision Tree model
dtree = DecisionTreeClassifier(random_state=1)
# Train the model
dtree.fit(X_train, y_train)
# Evaluate the model on the validation set
val_performance = model_performance_classification_sklearn(dtree, X_val, y_val)
print(val_performance)
Accuracy Recall Precision F1 0 0.97125 0.738739 0.742081 0.740406
observations:
Accuracy (0.97125): The model has a high accuracy of approximately 97.13%, meaning it correctly classifies around 97% of the validation samples. This indicates that the model performs well overall, but accuracy alone might not fully reflect performance, especially in cases of imbalanced datasets.
Recall (0.7387): The recall score of approximately 73.87% indicates the model's ability to correctly identify positive instances (e.g., failures) from all actual positive instances. While a decent recall score suggests the model is reasonably good at detecting failures, there is still room for improvement.
Precision (0.7421): The precision score of approximately 74.21% shows the proportion of positive identifications that were actually correct. This indicates that when the model predicts a failure, it is correct about 74% of the time.
F1 Score (0.7404): The F1 score, which is the harmonic mean of precision and recall, is approximately 74.04%. This metric balances the trade-off between precision and recall and reflects overall performance in terms of both false positives and false negatives.
# Type of scoring used to compare parameter combinations
# Defining scorer to be used for cross-validation and hyperparameter tuning
from sklearn import metrics
scorer = metrics.make_scorer(metrics.recall_score)
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import make_scorer, recall_score
# Define a scorer to maximize recall, which is important for reducing false negatives
scorer = make_scorer(recall_score)
# Initialize a list to hold models
models = []
# Append the Decision Tree model to the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
# Evaluate the models using cross-validation
results1 = []
names = []
print("\nCross-Validation performance on training dataset:\n")
for name, model in models:
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
cv_result = cross_val_score(estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
# Evaluate the models on the validation set
print("\nValidation Performance:\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: dtree: 0.7196280073636767 Validation Performance: dtree: 0.7387387387387387
Observation
The recall score obtained from cross-validation indicates that the Decision Tree model has an average recall of about 71.96% when evaluated on different subsets of the training data. This suggests that, on average, the model is able to identify approximately 72% of the actual failures during training. This is a good indication that the model can detect a significant portion of failures, which is important for predictive maintenance.
The recall score on the validation set is slightly higher at around 73.87%. This shows that the model's ability to detect failures is consistent when applied to unseen data. A slightly higher recall on the validation set compared to cross-validation suggests that the model generalizes well to new data, which is crucial for real-world application.
Performance Consistency: The model's performance on both cross-validation and the validation set is quite consistent, indicating that it performs reliably across different data subsets.
Recall Focus: Given that recall is a key metric for your objective (identifying failures before they occur), a recall score of around 73% suggests that the model is effective in detecting failures. This helps in minimizing the risk of false negatives, which could lead to higher replacement costs.
Model Evaluation: While the recall is reasonably high, it's essential to also consider other metrics like precision, F1 score, and the overall cost implications of false positives and false negatives in the context of maintenance costs.
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.
param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }
param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }
param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }
param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }
param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }
param_grid = {'C': np.arange(0.1,1.1,0.1)}
param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.5675998222560782:
oservations:
min_samples_leaf = 7:
min_impurity_decrease = 0.0001:
max_leaf_nodes = 15:
max_depth = 5:
CV Score of 0.5676:
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 3} with CV score=0.9143060712783726:
Observations:
min_samples_leaf = 7:
min_impurity_decrease = 0.001:
max_leaf_nodes = 15:
max_depth = 3:
CV Score of 0.9143:
The tuning with oversampled data has led to a much better-performing model, as evidenced by the high cross-validation score. The parameters selected suggest a well-balanced model that avoids overfitting while maintaining high predictive performance. This improvement indicates that the oversampling of the data has positively impacted the model's ability to generalize and predict failures accurately.
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,20),
'min_samples_leaf': [1, 2, 5, 7],
'max_leaf_nodes' : [5, 10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 5, 'max_depth': 14} with CV score=0.8492287183393639:
Observations:
min_samples_leaf = 1:
min_samples_leaf to 1 allows a leaf node to contain just one sample. This could lead to a more complex tree with potentially higher variance, as the tree can create very specific splits.min_impurity_decrease = 0.001:
max_leaf_nodes = 5:
max_depth = 14:
CV Score of 0.8492:
The parameters suggest a relatively deep and flexible tree with a high potential for capturing intricate patterns, but with some risk of overfitting, especially given the undersampled nature of the data. The performance, while good, is not as high as with the oversampled data, which might indicate that undersampling has led to a loss of some important information.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, make_scorer
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Load datasets
train_df = pd.read_csv('Train.csv.csv')
test_df = pd.read_csv('Test.csv.csv')
# Define the target column
target_column = 'Target'
# Separate features and target
X = train_df.drop(columns=[target_column])
y = train_df[target_column]
# Impute missing values in features
imputer = SimpleImputer(strategy='median')
X = imputer.fit_transform(X)
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
# Function to compute different metrics to check the classification model performance
def model_performance_classification_sklearn(model, predictors, target):
pred = model.predict(predictors)
acc = accuracy_score(target, pred)
recall = recall_score(target, pred)
precision = precision_score(target, pred)
f1 = f1_score(target, pred)
return pd.DataFrame({"Accuracy": [acc], "Recall": [recall], "Precision": [precision], "F1": [f1]})
# Define a scorer to maximize recall
scorer = make_scorer(recall_score)
# Initialize models
models = [
("DecisionTree", DecisionTreeClassifier(random_state=1))
]
# Hyperparameter tuning for original data
param_grid = {
'max_depth': np.arange(2, 6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes': [10, 15],
'min_impurity_decrease': [0.0001, 0.001]
}
# RandomizedSearchCV for original data
randomized_cv = RandomizedSearchCV(estimator=DecisionTreeClassifier(random_state=1),
param_distributions=param_grid,
n_iter=10,
n_jobs=-1,
scoring=scorer,
cv=5,
random_state=1)
randomized_cv.fit(X_train, y_train)
print("Original Data Best Parameters: {} with CV score={}".format(randomized_cv.best_params_, randomized_cv.best_score_))
# Model tuning on oversampled data
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
randomized_cv.fit(X_train_over, y_train_over)
print("Oversampled Data Best Parameters: {} with CV score={}".format(randomized_cv.best_params_, randomized_cv.best_score_))
# Model tuning on undersampled data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
randomized_cv.fit(X_train_un, y_train_un)
print("Undersampled Data Best Parameters: {} with CV score={}".format(randomized_cv.best_params_, randomized_cv.best_score_))
# Train the final model with the best parameters from your preferred method
best_params = randomized_cv.best_params_ # Choose the best parameters based on your preference
final_model = DecisionTreeClassifier(**best_params, random_state=1)
final_model.fit(X_train, y_train)
# Evaluate on the validation set
val_performance = model_performance_classification_sklearn(final_model, X_val, y_val)
print("Validation Performance:")
print(val_performance)
# Evaluate on the test set
X_test = test_df.drop(columns=[target_column])
y_test = test_df[target_column]
X_test = imputer.transform(X_test)
test_performance = model_performance_classification_sklearn(final_model, X_test, y_test)
print("Final Model Performance on Test Set:")
print(test_performance)
Original Data Best Parameters: {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.5675998222560782
Oversampled Data Best Parameters: {'min_samples_leaf': 7, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 3} with CV score=0.9143060712783726
Undersampled Data Best Parameters: {'min_samples_leaf': 1, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 10, 'max_depth': 5} with CV score=0.8581857423982733
Validation Performance:
Accuracy Recall Precision F1
0 0.96675 0.536036 0.798658 0.641509
Final Model Performance on Test Set:
Accuracy Recall Precision F1
0 0.9638 0.489362 0.788571 0.603939
Here's a comparative analysis of the model performances across different datasets and the final model performance:
Original Data:
{'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5}Oversampled Data:
{'min_samples_leaf': 7, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 3}Undersampled Data:
{'min_samples_leaf': 1, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 10, 'max_depth': 5}Best Cross-Validation Score:
Validation and Test Set Performance:
Choosing the Final Model:
all models included
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
# Load your data
train_df = pd.read_csv('Train.csv.csv')
test_df = pd.read_csv('Test.csv.csv')
# Define features and target
X = train_df.drop(columns=['Target'])
y = train_df['Target']
X_test = test_df.drop(columns=['Target'])
y_test = test_df['Target']
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the preprocessing pipeline for the training data
preprocessing_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('smote', SMOTE(random_state=42)),
('under', RandomUnderSampler(random_state=42)),
])
# Apply the pipeline to the training data (fit and resample)
X_train_resampled, y_train_resampled = preprocessing_pipeline.fit_resample(X_train, y_train)
# Apply the same imputer and scaler (without SMOTE or undersampling) to validation and test data
impute_scale_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
])
# Transform the validation and test data
X_val_transformed = impute_scale_pipeline.fit_transform(X_val)
X_test_transformed = impute_scale_pipeline.fit_transform(X_test)
# Define the models with hyperparameter grids
models = {
'Gradient Boosting': {
'model': GradientBoostingClassifier(random_state=42),
'params': {
'n_estimators': [100, 200],
'learning_rate': [0.01, 0.1],
'max_depth': [3, 5]
}
},
'AdaBoost': {
'model': AdaBoostClassifier(random_state=42),
'params': {
'n_estimators': [50, 100],
'learning_rate': [0.01, 0.1],
'estimator': [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=3)]
}
},
'Bagging': {
'model': BaggingClassifier(random_state=42),
'params': {
'n_estimators': [10, 50],
'base_estimator': [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=3)]
}
},
'Random Forest': {
'model': RandomForestClassifier(random_state=42),
'params': {
'n_estimators': [100, 200],
'max_depth': [3, 5]
}
},
'Logistic Regression': {
'model': LogisticRegression(max_iter=1000, random_state=42),
'params': {
'C': [0.01, 0.1, 1, 10],
'solver': ['liblinear', 'saga']
}
},
'XGBoost': {
'model': XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
'params': {
'n_estimators': [100, 200],
'learning_rate': [0.01, 0.1],
'max_depth': [3, 5]
}
}
}
# Evaluate each model using GridSearchCV
for name, m in models.items():
print(f"Evaluating {name}...")
grid = GridSearchCV(m['model'], m['params'], cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train_resampled, y_train_resampled)
best_model = grid.best_estimator_
# Evaluate on validation set
y_val_pred = best_model.predict(X_val_transformed)
print(f"Best Parameters for {name}: {grid.best_params_}")
print(f"Validation Accuracy for {name}: {accuracy_score(y_val, y_val_pred)}")
print(f"Classification Report for {name}:\n{classification_report(y_val, y_val_pred)}")
print(f"Confusion Matrix for {name}:\n{confusion_matrix(y_val, y_val_pred)}")
# Evaluate on test set
y_test_pred = best_model.predict(X_test_transformed)
print(f"Test Accuracy for {name}: {accuracy_score(y_test, y_test_pred)}")
print(f"Test Classification Report for {name}:\n{classification_report(y_test, y_test_pred)}")
print(f"Test Confusion Matrix for {name}:\n{confusion_matrix(y_test, y_test_pred)}\n")
Evaluating Gradient Boosting...
Best Parameters for Gradient Boosting: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}
Validation Accuracy for Gradient Boosting: 0.984
Classification Report for Gradient Boosting:
precision recall f1-score support
0 0.99 0.99 0.99 3775
1 0.83 0.89 0.86 225
accuracy 0.98 4000
macro avg 0.91 0.94 0.93 4000
weighted avg 0.98 0.98 0.98 4000
Confusion Matrix for Gradient Boosting:
[[3735 40]
[ 24 201]]
Test Accuracy for Gradient Boosting: 0.984
Test Classification Report for Gradient Boosting:
precision recall f1-score support
0 0.99 0.99 0.99 4718
1 0.86 0.85 0.86 282
accuracy 0.98 5000
macro avg 0.93 0.92 0.92 5000
weighted avg 0.98 0.98 0.98 5000
Test Confusion Matrix for Gradient Boosting:
[[4679 39]
[ 41 241]]
Evaluating AdaBoost...
Best Parameters for AdaBoost: {'estimator': DecisionTreeClassifier(max_depth=3), 'learning_rate': 0.1, 'n_estimators': 100}
Validation Accuracy for AdaBoost: 0.96725
Classification Report for AdaBoost:
precision recall f1-score support
0 0.99 0.97 0.98 3775
1 0.65 0.88 0.75 225
accuracy 0.97 4000
macro avg 0.82 0.93 0.87 4000
weighted avg 0.97 0.97 0.97 4000
Confusion Matrix for AdaBoost:
[[3670 105]
[ 26 199]]
Test Accuracy for AdaBoost: 0.9674
Test Classification Report for AdaBoost:
precision recall f1-score support
0 0.99 0.97 0.98 4718
1 0.66 0.85 0.75 282
accuracy 0.97 5000
macro avg 0.83 0.91 0.86 5000
weighted avg 0.97 0.97 0.97 5000
Test Confusion Matrix for AdaBoost:
[[4597 121]
[ 42 240]]
Evaluating Bagging...
/usr/local/lib/python3.10/dist-packages/sklearn/ensemble/_base.py:156: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4. warnings.warn(
Best Parameters for Bagging: {'base_estimator': DecisionTreeClassifier(max_depth=3), 'n_estimators': 50}
Validation Accuracy for Bagging: 0.77625
Classification Report for Bagging:
precision recall f1-score support
0 0.99 0.77 0.87 3775
1 0.18 0.87 0.30 225
accuracy 0.78 4000
macro avg 0.59 0.82 0.59 4000
weighted avg 0.94 0.78 0.84 4000
Confusion Matrix for Bagging:
[[2909 866]
[ 29 196]]
Test Accuracy for Bagging: 0.7838
Test Classification Report for Bagging:
precision recall f1-score support
0 0.99 0.78 0.87 4718
1 0.18 0.83 0.30 282
accuracy 0.78 5000
macro avg 0.59 0.80 0.59 5000
weighted avg 0.94 0.78 0.84 5000
Test Confusion Matrix for Bagging:
[[3686 1032]
[ 49 233]]
Evaluating Random Forest...
Best Parameters for Random Forest: {'max_depth': 5, 'n_estimators': 200}
Validation Accuracy for Random Forest: 0.9435
Classification Report for Random Forest:
precision recall f1-score support
0 0.99 0.95 0.97 3775
1 0.50 0.87 0.63 225
accuracy 0.94 4000
macro avg 0.75 0.91 0.80 4000
weighted avg 0.96 0.94 0.95 4000
Confusion Matrix for Random Forest:
[[3578 197]
[ 29 196]]
Test Accuracy for Random Forest: 0.9458
Test Classification Report for Random Forest:
precision recall f1-score support
0 0.99 0.95 0.97 4718
1 0.51 0.85 0.64 282
accuracy 0.95 5000
macro avg 0.75 0.90 0.80 5000
weighted avg 0.96 0.95 0.95 5000
Test Confusion Matrix for Random Forest:
[[4490 228]
[ 43 239]]
Evaluating Logistic Regression...
Best Parameters for Logistic Regression: {'C': 0.01, 'solver': 'saga'}
Validation Accuracy for Logistic Regression: 0.87225
Classification Report for Logistic Regression:
precision recall f1-score support
0 0.99 0.87 0.93 3775
1 0.29 0.88 0.44 225
accuracy 0.87 4000
macro avg 0.64 0.87 0.68 4000
weighted avg 0.95 0.87 0.90 4000
Confusion Matrix for Logistic Regression:
[[3292 483]
[ 28 197]]
Test Accuracy for Logistic Regression: 0.8702
Test Classification Report for Logistic Regression:
precision recall f1-score support
0 0.99 0.87 0.93 4718
1 0.28 0.80 0.41 282
accuracy 0.87 5000
macro avg 0.63 0.84 0.67 5000
weighted avg 0.95 0.87 0.90 5000
Test Confusion Matrix for Logistic Regression:
[[4124 594]
[ 55 227]]
Evaluating XGBoost...
/usr/local/lib/python3.10/dist-packages/xgboost/core.py:158: UserWarning: [06:05:30] WARNING: /workspace/src/learner.cc:740:
Parameters: { "use_label_encoder" } are not used.
warnings.warn(smsg, UserWarning)
Best Parameters for XGBoost: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}
Validation Accuracy for XGBoost: 0.98425
Classification Report for XGBoost:
precision recall f1-score support
0 0.99 0.99 0.99 3775
1 0.84 0.89 0.86 225
accuracy 0.98 4000
macro avg 0.92 0.94 0.93 4000
weighted avg 0.98 0.98 0.98 4000
Confusion Matrix for XGBoost:
[[3737 38]
[ 25 200]]
Test Accuracy for XGBoost: 0.9836
Test Classification Report for XGBoost:
precision recall f1-score support
0 0.99 0.99 0.99 4718
1 0.86 0.85 0.85 282
accuracy 0.98 5000
macro avg 0.93 0.92 0.92 5000
weighted avg 0.98 0.98 0.98 5000
Test Confusion Matrix for XGBoost:
[[4679 39]
[ 43 239]]
Observation
Gradient Boosting:
AdaBoost:
Bagging:
Random Forest:
Logistic Regression:
XGBoost:
Top Performers: Gradient Boosting and XGBoost have the highest accuracy on both validation and test sets, with well-balanced precision and recall for the positive class. Both models are strong candidates for final deployment.
Bagging: Shows weaker performance, particularly in handling class imbalance, and might not be suitable for your specific needs.
Logistic Regression: While it has decent accuracy, its lower recall for the positive class compared to ensemble methods suggests it might not be the best choice for imbalanced data.
Random Forest: Performs well but slightly less so than Gradient Boosting and XGBoost in this case.
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
# Load your data
train_df = pd.read_csv('Train.csv.csv')
test_df = pd.read_csv('Test.csv.csv')
# Define features and target
X = train_df.drop(columns=['Target'])
y = train_df['Target']
X_test = test_df.drop(columns=['Target'])
y_test = test_df['Target']
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the preprocessing pipeline for the training data
preprocessing_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('smote', SMOTE(random_state=42)),
('under', RandomUnderSampler(random_state=42)),
])
# Apply the pipeline to the training data (fit and resample)
X_train_resampled, y_train_resampled = preprocessing_pipeline.fit_resample(X_train, y_train)
# Apply the same imputer and scaler (without SMOTE or undersampling) to validation and test data
impute_scale_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
])
# Transform the validation and test data
X_val_transformed = impute_scale_pipeline.fit_transform(X_val)
X_test_transformed = impute_scale_pipeline.fit_transform(X_test)
# Define the models with hyperparameter grids
models = {
'Gradient Boosting': {
'model': GradientBoostingClassifier(random_state=42),
'params': {
'n_estimators': [100, 200],
'learning_rate': [0.01, 0.1],
'max_depth': [3, 5]
}
},
'AdaBoost': {
'model': AdaBoostClassifier(random_state=42),
'params': {
'n_estimators': [50, 100],
'learning_rate': [0.01, 0.1],
'estimator': [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=3)]
}
},
'Bagging': {
'model': BaggingClassifier(random_state=42),
'params': {
'n_estimators': [10, 50],
'base_estimator': [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=3)]
}
},
'Random Forest': {
'model': RandomForestClassifier(random_state=42),
'params': {
'n_estimators': [100, 200],
'max_depth': [3, 5]
}
},
'Logistic Regression': {
'model': LogisticRegression(max_iter=1000, random_state=42),
'params': {
'C': [0.01, 0.1, 1, 10],
'solver': ['liblinear', 'saga']
}
},
'XGBoost': {
'model': XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
'params': {
'n_estimators': [100, 200],
'learning_rate': [0.01, 0.1],
'max_depth': [3, 5]
}
}
}
# Store the results in a dictionary
test_results = {}
# Evaluate each model on the test set and store the results
for name, m in models.items():
print(f"Evaluating {name} on the Test Set...")
grid = GridSearchCV(m['model'], m['params'], cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train_resampled, y_train_resampled)
best_model = grid.best_estimator_
# Predict on the test set
y_test_pred = best_model.predict(X_test_transformed)
# Store the results
test_results[name] = {
'Best Parameters': grid.best_params_,
'Test Accuracy': accuracy_score(y_test, y_test_pred),
'Test Classification Report': classification_report(y_test, y_test_pred, output_dict=True),
'Test Confusion Matrix': confusion_matrix(y_test, y_test_pred)
}
# Display the results
for name, result in test_results.items():
print(f"\n{name}")
print(f"Best Parameters: {result['Best Parameters']}")
print(f"Test Accuracy: {result['Test Accuracy']}")
print(f"Test Classification Report:")
print(pd.DataFrame(result['Test Classification Report']).transpose())
print(f"Test Confusion Matrix:\n{result['Test Confusion Matrix']}")
Evaluating Gradient Boosting on the Test Set... Evaluating AdaBoost on the Test Set... Evaluating Bagging on the Test Set...
/usr/local/lib/python3.10/dist-packages/sklearn/ensemble/_base.py:156: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4. warnings.warn(
Evaluating Random Forest on the Test Set... Evaluating Logistic Regression on the Test Set... Evaluating XGBoost on the Test Set...
/usr/local/lib/python3.10/dist-packages/xgboost/core.py:158: UserWarning: [07:56:40] WARNING: /workspace/src/learner.cc:740:
Parameters: { "use_label_encoder" } are not used.
warnings.warn(smsg, UserWarning)
Gradient Boosting
Best Parameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}
Test Accuracy: 0.984
Test Classification Report:
precision recall f1-score support
0 0.991314 0.991734 0.991524 4718.000
1 0.860714 0.854610 0.857651 282.000
accuracy 0.984000 0.984000 0.984000 0.984
macro avg 0.926014 0.923172 0.924587 5000.000
weighted avg 0.983948 0.984000 0.983973 5000.000
Test Confusion Matrix:
[[4679 39]
[ 41 241]]
AdaBoost
Best Parameters: {'estimator': DecisionTreeClassifier(max_depth=3), 'learning_rate': 0.1, 'n_estimators': 100}
Test Accuracy: 0.9674
Test Classification Report:
precision recall f1-score support
0 0.990946 0.974354 0.982580 4718.0000
1 0.664820 0.851064 0.746501 282.0000
accuracy 0.967400 0.967400 0.967400 0.9674
macro avg 0.827883 0.912709 0.864540 5000.0000
weighted avg 0.972553 0.967400 0.969265 5000.0000
Test Confusion Matrix:
[[4597 121]
[ 42 240]]
Bagging
Best Parameters: {'base_estimator': DecisionTreeClassifier(max_depth=3), 'n_estimators': 50}
Test Accuracy: 0.7838
Test Classification Report:
precision recall f1-score support
0 0.986881 0.781263 0.872116 4718.0000
1 0.184190 0.826241 0.301228 282.0000
accuracy 0.783800 0.783800 0.783800 0.7838
macro avg 0.585535 0.803752 0.586672 5000.0000
weighted avg 0.941609 0.783800 0.839918 5000.0000
Test Confusion Matrix:
[[3686 1032]
[ 49 233]]
Random Forest
Best Parameters: {'max_depth': 5, 'n_estimators': 200}
Test Accuracy: 0.9458
Test Classification Report:
precision recall f1-score support
0 0.990514 0.951674 0.970706 4718.0000
1 0.511777 0.847518 0.638184 282.0000
accuracy 0.945800 0.945800 0.945800 0.9458
macro avg 0.751146 0.899596 0.804445 5000.0000
weighted avg 0.963513 0.945800 0.951952 5000.0000
Test Confusion Matrix:
[[4490 228]
[ 43 239]]
Logistic Regression
Best Parameters: {'C': 0.01, 'solver': 'saga'}
Test Accuracy: 0.8702
Test Classification Report:
precision recall f1-score support
0 0.986839 0.874099 0.927054 4718.0000
1 0.276492 0.804965 0.411605 282.0000
accuracy 0.870200 0.870200 0.870200 0.8702
macro avg 0.631666 0.839532 0.669329 5000.0000
weighted avg 0.946775 0.870200 0.897983 5000.0000
Test Confusion Matrix:
[[4124 594]
[ 55 227]]
XGBoost
Best Parameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}
Test Accuracy: 0.9836
Test Classification Report:
precision recall f1-score support
0 0.990894 0.991734 0.991314 4718.0000
1 0.859712 0.847518 0.853571 282.0000
accuracy 0.983600 0.983600 0.983600 0.9836
macro avg 0.925303 0.919626 0.922442 5000.0000
weighted avg 0.983495 0.983600 0.983545 5000.0000
Test Confusion Matrix:
[[4679 39]
[ 43 239]]
Observation
{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}{'estimator': DecisionTreeClassifier(max_depth=3), 'learning_rate': 0.1, 'n_estimators': 100}{'base_estimator': DecisionTreeClassifier(max_depth=3), 'n_estimators': 50}{'max_depth': 5, 'n_estimators': 200}{'C': 0.01, 'solver': 'saga'}{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
# Load your data
train_df = pd.read_csv('Train.csv.csv')
test_df = pd.read_csv('Test.csv.csv')
# Define features and target
X = train_df.drop(columns=['Target'])
y = train_df['Target']
X_test = test_df.drop(columns=['Target'])
y_test = test_df['Target']
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the models with hyperparameter grids
models = {
'Gradient Boosting': {
'model': GradientBoostingClassifier(random_state=42),
'params': {
'model__n_estimators': [100, 200],
'model__learning_rate': [0.01, 0.1],
'model__max_depth': [3, 5]
}
},
'AdaBoost': {
'model': AdaBoostClassifier(random_state=42),
'params': {
'model__n_estimators': [50, 100],
'model__learning_rate': [0.01, 0.1],
'model__estimator': [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=3)]
}
},
'Bagging': {
'model': BaggingClassifier(random_state=42),
'params': {
'model__n_estimators': [10, 50],
'model__base_estimator': [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=3)]
}
},
'Random Forest': {
'model': RandomForestClassifier(random_state=42),
'params': {
'model__n_estimators': [100, 200],
'model__max_depth': [3, 5]
}
},
'Logistic Regression': {
'model': LogisticRegression(max_iter=1000, random_state=42),
'params': {
'model__C': [0.01, 0.1, 1, 10],
'model__solver': ['liblinear', 'saga']
}
},
'XGBoost': {
'model': XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
'params': {
'model__n_estimators': [100, 200],
'model__learning_rate': [0.01, 0.1],
'model__max_depth': [3, 5]
}
}
}
# Store the results in a dictionary
test_results = {}
# Evaluate each model using a pipeline and GridSearchCV
for name, m in models.items():
print(f"Evaluating {name} with Pipeline on the Test Set...")
# Create a pipeline with preprocessing and model
model_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('smote', SMOTE(random_state=42)),
('under', RandomUnderSampler(random_state=42)),
('model', m['model'])
])
# Grid search for the best hyperparameters
grid = GridSearchCV(model_pipeline, m['params'], cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)
# Get the best model from the grid search
best_model = grid.best_estimator_
# Predict on the test set
y_test_pred = best_model.predict(X_test)
# Store the results
test_results[name] = {
'Best Parameters': grid.best_params_,
'Test Accuracy': accuracy_score(y_test, y_test_pred),
'Test Classification Report': classification_report(y_test, y_test_pred, output_dict=True),
'Test Confusion Matrix': confusion_matrix(y_test, y_test_pred)
}
# Display the results
for name, result in test_results.items():
print(f"\n{name}")
print(f"Best Parameters: {result['Best Parameters']}")
print(f"Test Accuracy: {result['Test Accuracy']}")
print(f"Test Classification Report:")
print(pd.DataFrame(result['Test Classification Report']).transpose())
print(f"Test Confusion Matrix:\n{result['Test Confusion Matrix']}")
Evaluating Gradient Boosting with Pipeline on the Test Set... Evaluating AdaBoost with Pipeline on the Test Set... Evaluating Bagging with Pipeline on the Test Set...
/usr/local/lib/python3.10/dist-packages/sklearn/ensemble/_base.py:156: FutureWarning: `base_estimator` was renamed to `estimator` in version 1.2 and will be removed in 1.4. warnings.warn(
Evaluating Random Forest with Pipeline on the Test Set... Evaluating Logistic Regression with Pipeline on the Test Set... Evaluating XGBoost with Pipeline on the Test Set...
/usr/local/lib/python3.10/dist-packages/xgboost/core.py:158: UserWarning: [12:20:01] WARNING: /workspace/src/learner.cc:740:
Parameters: { "use_label_encoder" } are not used.
warnings.warn(smsg, UserWarning)
Gradient Boosting
Best Parameters: {'model__learning_rate': 0.1, 'model__max_depth': 5, 'model__n_estimators': 200}
Test Accuracy: 0.9844
Test Classification Report:
precision recall f1-score support
0 0.991317 0.992158 0.991737 4718.0000
1 0.866906 0.854610 0.860714 282.0000
accuracy 0.984400 0.984400 0.984400 0.9844
macro avg 0.929112 0.923384 0.926226 5000.0000
weighted avg 0.984300 0.984400 0.984348 5000.0000
Test Confusion Matrix:
[[4681 37]
[ 41 241]]
AdaBoost
Best Parameters: {'model__estimator': DecisionTreeClassifier(max_depth=3), 'model__learning_rate': 0.1, 'model__n_estimators': 100}
Test Accuracy: 0.9676
Test Classification Report:
precision recall f1-score support
0 0.990737 0.974777 0.982692 4718.0000
1 0.667598 0.847518 0.746875 282.0000
accuracy 0.967600 0.967600 0.967600 0.9676
macro avg 0.829167 0.911148 0.864784 5000.0000
weighted avg 0.972512 0.967600 0.969392 5000.0000
Test Confusion Matrix:
[[4599 119]
[ 43 239]]
Bagging
Best Parameters: {'model__base_estimator': DecisionTreeClassifier(max_depth=3), 'model__n_estimators': 50}
Test Accuracy: 0.791
Test Classification Report:
precision recall f1-score support
0 0.987006 0.788894 0.876900 4718.000
1 0.189585 0.826241 0.308405 282.000
accuracy 0.791000 0.791000 0.791000 0.791
macro avg 0.588296 0.807567 0.592652 5000.000
weighted avg 0.942032 0.791000 0.844836 5000.000
Test Confusion Matrix:
[[3722 996]
[ 49 233]]
Random Forest
Best Parameters: {'model__max_depth': 5, 'model__n_estimators': 100}
Test Accuracy: 0.949
Test Classification Report:
precision recall f1-score support
0 0.990979 0.954642 0.972471 4718.000
1 0.529670 0.854610 0.654003 282.000
accuracy 0.949000 0.949000 0.949000 0.949
macro avg 0.760325 0.904626 0.813237 5000.000
weighted avg 0.964961 0.949000 0.954510 5000.000
Test Confusion Matrix:
[[4504 214]
[ 41 241]]
Logistic Regression
Best Parameters: {'model__C': 0.01, 'model__solver': 'saga'}
Test Accuracy: 0.8746
Test Classification Report:
precision recall f1-score support
0 0.986908 0.878762 0.929701 4718.0000
1 0.284105 0.804965 0.419981 282.0000
accuracy 0.874600 0.874600 0.874600 0.8746
macro avg 0.635507 0.841863 0.674841 5000.0000
weighted avg 0.947270 0.874600 0.900952 5000.0000
Test Confusion Matrix:
[[4146 572]
[ 55 227]]
XGBoost
Best Parameters: {'model__learning_rate': 0.1, 'model__max_depth': 5, 'model__n_estimators': 200}
Test Accuracy: 0.984
Test Classification Report:
precision recall f1-score support
0 0.990898 0.992158 0.991527 4718.000
1 0.865942 0.847518 0.856631 282.000
accuracy 0.984000 0.984000 0.984000 0.984
macro avg 0.928420 0.919838 0.924079 5000.000
weighted avg 0.983850 0.984000 0.983919 5000.000
Test Confusion Matrix:
[[4681 37]
[ 43 239]]
Observation
{'model__learning_rate': 0.1, 'model__max_depth': 5, 'model__n_estimators': 200}{'model__estimator': DecisionTreeClassifier(max_depth=3), 'model__learning_rate': 0.1, 'model__n_estimators': 100}{'model__base_estimator': DecisionTreeClassifier(max_depth=3), 'model__n_estimators': 50}{'model__max_depth': 5, 'model__n_estimators': 100}{'model__C': 0.01, 'model__solver': 'saga'}{'model__learning_rate': 0.1, 'model__max_depth': 5, 'model__n_estimators': 200}1. Model Performance Analysis
Based on the performance of the classification models, the following insights can be drawn:
Gradient Boosting and XGBoost: These models demonstrate the highest test accuracy (approximately 98.4%). They also provide strong precision and recall for predicting failures, which indicates their effectiveness in minimizing both false positives and false negatives. For the company, this means these models are highly reliable for predicting generator failures, thus potentially reducing both inspection and replacement costs.
Random Forest: This model also performs well with a test accuracy of around 94.9%. It balances precision and recall reasonably well, making it a good alternative to Gradient Boosting and XGBoost. The relatively lower accuracy compared to the top performers still makes it a viable option for reducing maintenance costs.
AdaBoost: While its accuracy is slightly lower (around 96.8%) than the top performers, it still offers a good balance between precision and recall. This model could be considered if Gradient Boosting and XGBoost models are not feasible for some reason, such as computational limitations.
Logistic Regression: This model has the lowest accuracy among the tested models (approximately 87.5%). It shows good precision for non-failure cases but struggles with detecting failures effectively. Given the higher cost associated with false negatives (missed failures), this model might not be as effective for this application.
Bagging: This model exhibits the lowest performance, with an accuracy of around 79.1%. It performs poorly in detecting failures, which could lead to higher replacement costs due to missed predictions. Its use should be limited, especially if other models are available.
2. Cost Implications
Minimizing Replacement Costs: Accurate failure prediction (true positives) helps in early repairs, avoiding the much higher replacement costs. Models like Gradient Boosting and XGBoost, with high accuracy and recall for failures, will be crucial in minimizing these replacement costs.
Inspection vs. Repair Costs: Although false positives (incorrectly predicting a failure) incur inspection costs, these are lower compared to repair and replacement costs. Therefore, models with slightly higher false positive rates (like AdaBoost and Random Forest) could still be considered if they offer substantial reductions in replacement costs.
3. Model Selection Recommendation
Primary Recommendation: Gradient Boosting or XGBoost should be considered as the primary models for deployment due to their superior performance in accuracy and recall for failure prediction. This will lead to significant cost savings by reducing the number of undetected failures and, consequently, replacement costs.
Secondary Recommendation: Random Forest can be used as a backup model if computational resources are a concern or if further tuning is required. It provides a good balance of performance and cost efficiency.
Avoid: Logistic Regression and Bagging should be avoided in favor of models that offer better overall performance and cost-saving potential.
4. Next Steps
Deployment and Monitoring: Implement the chosen model(s) and continuously monitor their performance to ensure they maintain effectiveness. Adjustments and re-training may be necessary as new data becomes available.
Cost-Benefit Analysis: Perform a detailed cost-benefit analysis to quantify the financial impact of implementing the selected model(s) and to validate the expected savings in maintenance and operational costs.
Model Refinement: Consider further tuning and refinement of the models to enhance their performance and adapt to any changes in the operational environment or data distribution.